CLN: use integer parsing functions from stdlib #62658

Alvaro-Kothe · 2025-10-11T15:32:37Z

This was a suggestion from @WillAyd in #62623 (comment)

This simplifies the current parsing logic by utilizing existing functions to parse integers from stdlib (strtoll and strtoull) instead of using our own.

~~One disadvantage of this approach is that in case there is a thousand separator (tsep) we allocate memory on the heap and process the existing string to remove this separator.~~

WillAyd

I think this is much more difficult and less performant than it needs to be because we have a suboptimal structure for the parser.

Instead of storing a char **words member, is there a way that we can define a struct like:

struct PandasParsedString {
    const char *str;
    size_t len;
    int64_t[2] is_separator;
}

and then make the member in the parser a PandasParsedString *words array. As you parse each word you can assign these struct members once, instead of having to query them after the fact.

I'm assuming the is_separator member is a bit-packed struct signifying whether a particular character is a separator or not (see other comment about a 128 character limit). We could implement this a few other ways, but I figure this is the most memory efficient.

With that information up front you can avoid having to do a character search for separators here way after the fact; if there are no separators you can avoid looping altogether, but if there are you can move a region of characters in batches to the local buffer rather than having to go char by char

WillAyd · 2025-10-13T13:52:53Z

pandas/_libs/include/pandas/parser/tokenizer.h

 #define ERROR_NO_DIGITS 1
 #define ERROR_OVERFLOW 2
 #define ERROR_INVALID_CHARS 3
+#define ERROR_NO_MEMORY 4


There's already a PARSER_OUT_OF_MEMORY define in this header - does this provide anything different?

I missed it. Since we will avoid using dynamic memory, I will delete it soon and the memory error won't be applicable.

WillAyd · 2025-10-13T13:58:34Z

pandas/_libs/src/parser/tokenizer.c

+    }
+  }
+
+  char *start = malloc((chars_to_copy + 1) * sizeof(char));


For performance reasons we should avoid using the heap as much as possible. Given this only applies to numeric digits, I think it would be better to just stack allocate an array of a certain size.

Trying to be forward looking, we can probably get away with a local heap of size 128 bytes, since that covers the maximum character length of Arrow's Decimal256 (76), and sizes up to a power of 2. Anything longer that that can reasonably not be parsed as numeric

For performance reasons we should avoid using the heap as much as possible. Given this only applies to numeric digits, I think it would be better to just stack allocate an array of a certain size.

Question: The approach of using a stack will not be thread safe. Is this a problem?

Another approach that I was thinking of was to use a do-while loop, verifying if it stopped in tsep. It will complicate the code a little bit, but will be thread safe and won't need dynamic memory.

Every thread gets its own stack, so the stack solution is thread safe while the heap is not. Perhaps you are mixing stack/local variables with the concept of static?

Of course we aren't doing any multithreaded parsing here, but that's a great question and thing to consider

Every thread gets its own stack, so the stack solution is thread safe while the heap is not.

Thanks, I didn't know that.

Perhaps you are mixing stack/local variables with the concept of static?

I first thought that I should make the array a global variable, that's the reason for the confusion.

I created a buffer of size 128 in the stack to remove tsep. With these changes, I removed the malloc call and the memory error.

WillAyd · 2025-10-13T18:50:02Z

pandas/_libs/src/parser/tokenizer.c

+ * @param p_item Pointer to verify
+ * @return Non-zero integer indicating that has a digit 0 otherwise.
+ */
+static inline int has_digit_int(const char *str) {


Suggested change

static inline int has_digit_int(const char *str) {

static inline bool has_digit_int(const char *str) {

We are far enough removed from C89 that we can use the bool type :-)

Done in 87789e6

WillAyd · 2025-10-13T18:53:36Z

pandas/_libs/src/parser/tokenizer.c

+/* Copy a string without `char_to_remove` into `output`,
+ * while ensuring it's null terminated.
+ */
+static void copy_string_without_char(char *output, const char *str,


Suggested change

static void copy_string_without_char(char *output, const char *str,

static void copy_string_without_char(char output[PROCESSED_WORD_CAPACITY], const char *str,

Don't think you need to pass this capacity as a separate argument, since it is always the same value

Done in 2287944

WillAyd · 2025-10-13T18:54:58Z

pandas/_libs/src/parser/tokenizer.c

+      output[i++] = *src;
    }
+  }
+  if (i < output_size) {


This is pretty wasteful to continually write null bytes. You could either memset the buffer before you send it, or ideally short circuit when you've processed all the necessary bytes

continually write null bytes

Actually, it wasn't continually writing null bytes. Just once after copying src. Anyway, I changed to use memset.

WillAyd · 2025-10-13T18:55:59Z

pandas/_libs/src/parser/tokenizer.c

+static void copy_string_without_char(char *output, const char *str,
+                                     char char_to_remove, size_t output_size) {
+  size_t i = 0;
+  for (const char *src = str; *src != '\0' && i < output_size; src++) {


It would be great to avoid character-by-character writes if we can, especially since this is a relatively sparse character to search for.

You might be limited with what you can do without some of the larger changes I suggested, but maybe even just finding a region and writing multiple bytes at once will be more performant.

Perhaps something like:

size_t pos = 0; const char* oldptr = p_item; while (const char* newptr = strchr(oldptr, tsep) != NULL) { size_t len = newptr - oldptr; memcpy(output[pos], oldptr, len); oldptr = newptr + 1; pos += len; } output[startpos + 1] = '\0';

Might be an off by one and probably worth bounds checking. You might also want to use strchrnul instead of strchr to avoid a buffer overrun

Just some brief unchecked thoughts - hope they help

I couldn't make strchr work, because I couldn't know where the null terminator would be. I tried your suggestion in 2bea3c2.

Alvaro-Kothe · 2025-10-13T21:14:55Z

I don't understand the error on MacOS, but they are definitely related to the changes here. They weren't happening when I was using malloc.

I will try a different solution that don't rely on transforming the string.

WillAyd · 2025-10-14T06:00:01Z

pandas/_libs/src/parser/tokenizer.c

-          return 0;
-        }
-      }
+static inline int64_t add_int_check_overflow(int64_t lhs, int64_t rhs,


There's no need to implement custom overflow detection. We already have macros that defer to compiler builtins (see the np_datetime.c module). You can refactor those if needed to make available here.

FWIW C23 also standardizes checked overflow functions. I'm not sure how far along MSVC is in implementing that, but for other platforms we can likely use the standard on newer compilers.

The pinned meson-python version doesn't allow compiling with std=c23. I don't know why stdckdint.h worked on my system with c11.

I will use the macro in np_datetime.c to check for overflow, but I will move it to the header.

I am still using my implementation for uint overflow, because the NumPy implementation seems specific to int64 (at least for Windows).

The pinned meson-python version doesn't allow compiling with std=c23. I don't know why stdckdint.h worked on my system with c11.

You can specify multiple standards with Meson, and it will pick the first that the compiler supports. So can just set the option of c_std=c23,c11

I don't know why stdckdint.h worked on my system with c11

While stdckdint.h may not have been standardized until after c11, it doesn't exclude your compiler from using it if it has already implemented.

I am still using my implementation for uint overflow, because the NumPy implementation seems specific to int64 (at least for Windows).

Its unfortunate the MSVC implementation doesn't use a generic macro like the gcc/clang versions do to make these work regardless of size. However, there's the ULongLongAdd function you can use instead of your own implementation:

https://learn.microsoft.com/en-us/windows/win32/api/intsafe/nf-intsafe-ulonglongadd

Generally opt for the compiler built-ins when available; they are almost assuredly faster, and mean less code we have to maintain :-)

You can specify multiple standards with Meson, and it will pick the first that the compiler supports. So can just set the option of c_std=c23,c11

I got this error

../../meson.build:2:0: ERROR: Value "c11,c23" (of type "string") for combo option "C language standard to use" is not one of the choices. Possible choices are (as string): "none",
"c89", "c99", "c11", "c17", "c18", "c2x", "gnu89", "gnu99", "gnu11", "gnu17", "gnu18", "gnu2x".

It needs meson>=1.4

Revert "fix: change std to c2x" This reverts commit 9046ecc. Revert "fix: use builtin overflow check verification" This reverts commit d85aaf0.

WillAyd · 2025-10-14T13:33:09Z

pandas/_libs/include/pandas/vendored/numpy/datetime/np_datetime.h

 #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
 #endif // NPY_NO_DEPRECATED_API

+#if defined(_WIN32)


Its probably best to move these to the include/pandas/portable.h header

WillAyd · 2025-10-14T13:54:35Z

pandas/_libs/include/pandas/portable.h

+_Static_assert(0,
+               "Overflow checking not detected; please try a newer compiler");
+#endif
+// __has_builtin was added in gcc 10, but our muslinux_1_1 build environment


Not sure why this is a separate branch down here, but I think you can combine with the macros on lines 51/52

WillAyd · 2025-10-14T13:56:11Z

pandas/_libs/src/parser/tokenizer.c

+    return true;
+  case '+':
+  case '-':
+    return isdigit_ascii(str[1]);


This will violate bounds checking if you pass a string of + or -

WillAyd · 2025-10-14T13:57:26Z

pandas/_libs/src/parser/tokenizer.c

+  return *str == '\0';
+}
+
+static int power_int(int base, int exponent) {


Why is int the right return type for this?

Mainly because I expect the result to be a small power of 10, up to 10**6

WillAyd · 2025-10-14T13:57:36Z

pandas/_libs/src/parser/tokenizer.c

+    exponent /= 2;
+  }
+
+  return result * base;


This looks like it could overflow easily

pandas/_libs/src/parser/tokenizer.c

WillAyd · 2025-10-14T13:58:26Z

pandas/_libs/src/parser/tokenizer.c

-            ((number == pre_max) && (d - '0' <= dig_pre_max))) {
-          number = number * 10 + (d - '0');
-          d = *++p;
+  errno = 0;


Do we need to check that the errno isn't set already before just clearing?

It doesn't seem necessary, this is just parsing a new number and error handling is done in parsers.pyx

I think you can remove this now, right? Generally I'd advise against clearing errno without good reason. This currently is equivalent to doing a try ... except Exception in Python and blindly continuing along, so it should be avoided

The errno should be reset. I added a comment explaining why. Mainly because strtoll assign its value and I want to reset before calling it, so that parsing previous words don't pollute the verification below.

WillAyd · 2025-10-14T14:00:02Z

pandas/_libs/src/parser/tokenizer.c

-          d = *++p;
+  while (errno == 0 && tsep != '\0' && *endptr == tsep) {
+    // Skip multiple consecutive tsep
+    while (*endptr == tsep) {


Maybe I am misreading but its strange to see this in the invariant here and directly preceding it. Can we consolidate that logic somehow?

Another option is to move endptr to the next character and make sure it's a digit, if it's not, exit the loop.

That will probably read easier

WillAyd · 2025-10-14T14:00:19Z

pandas/_libs/src/parser/tokenizer.c

-          return 0;
-        }
-      }
+    char *new_end = NULL;


No need to assign NULL

WillAyd · 2025-10-14T14:02:35Z

pandas/_libs/src/parser/tokenizer.c

-  while (isspace_ascii(*p)) {
-    ++p;
+    ptrdiff_t digits = new_end - endptr;
+    int64_t mul_result = power_int(10, (int)digits);


Ah OK - I think you are assembling this in pieces. However, I'd still suggest just getting rid of the separators in a local buffer and then just calling strtoll (or whatever function) on the whole buffer. There's a lot of nuance to range/value handling that its not worth reimplementing

The problem with the local buffer is that there were failing tests on macOS that I couldn't reproduce. The malloc solution worked, but the local buffer on the stack was failing.

That sounds like you were hitting some undefined behavior, but that would come from whatever the implementation was, not the general solution. We should definitely go back to that route and just see where we got stuck

Ok. I will address some of your comments here, commit, and then revert to that state.

WillAyd · 2025-10-14T14:03:12Z

pandas/_libs/src/parser/tokenizer.c

-        number = number * 10 + (d - '0');
-        d = *++p;
+  errno = 0;
+  char *endptr = NULL;


Don't assign NULL

WillAyd · 2025-10-15T19:54:06Z

pandas/_libs/src/parser/tokenizer.c

-  // Skip leading spaces.
-  while (isspace_ascii(*p)) {
-    ++p;
+  if (!p_item || *p_item == '\0') {


Is it necessary to add this? I don't think we should be special-casing behavior for null or the null byte; I expect the former is undefined (since its not a string) and the latter would be handled naturally by the rest of the logic (?)

Indeed, not necessary. The null byte is already handled below.

WillAyd · 2025-10-15T19:56:17Z

pandas/_libs/src/parser/tokenizer.c

-  // Skip trailing spaces.
-  while (isspace_ascii(*p)) {
-    ++p;
+  if (errno == ERANGE) {


This is lost in a refactor right? I don't think errno will be set by anything preceding this?

Indeed, I forgot to remove it.

WillAyd · 2025-10-15T19:56:31Z

pandas/_libs/src/parser/tokenizer.c


-  // Did we use up all the characters?
-  if (*p) {
+  char *endptr = NULL;


Suggested change

char *endptr = NULL;

char *endptr;

WillAyd · 2025-10-15T19:57:48Z

pandas/_libs/src/parser/tokenizer.c

+  char *endptr = NULL;
+  // strtoll sets errno if it finds an overflow.
+  // It's value is reset to don't pollute the verification below.
+  errno = 0;


As a matter of convention need to check errno and "raise" if its set before clearing it

By "raise" you mean printing to stderr?

By "raise" I mean follow whatever exception handling is implemented. In this particular function, it looks like you should set the error variable and return 0 when an exception is encountered

You might want to give the CPython error handling doc a read, which most of the extensions base their methodology off of:

https://docs.python.org/3/c-api/exceptions.html#exception-handling

I am now resetting errno after handling the overflow error

WillAyd · 2025-10-15T21:22:34Z

pandas/_libs/src/parser/tokenizer.c

  // Skip leading spaces.
-  while (isspace_ascii(*p)) {
-    ++p;
+  while (isspace_ascii(*p_item)) {


I think you removed the const char *p = p_item assignment as a cleanup and not for any type of functionality, but if that's true its inflating the diff. Should move cleanup items like that to a separate PR

I've made several changes that complicate a lot the diff. I'll put back the const char *p assignment, and also remove the need for some auxiliary functions created.

Great - thanks!

Alvaro-Kothe · 2025-10-16T12:08:30Z

I think this is much more difficult and less performant than it needs to be because we have a suboptimal structure for the parser.

Instead of storing a char **words member, is there a way that we can define a struct like:
struct PandasParsedString {
    const char *str;
    size_t len;
    int64_t[2] is_separator;
}
and then make the member in the parser a PandasParsedString *words array. As you parse each word you can assign these struct members once, instead of having to query them after the fact.

I'm assuming the is_separator member is a bit-packed struct signifying whether a particular character is a separator or not (see other comment about a 128 character limit). We could implement this a few other ways, but I figure this is the most memory efficient.

With that information up front you can avoid having to do a character search for separators here way after the fact; if there are no separators you can avoid looping altogether, but if there are you can move a region of characters in batches to the local buffer rather than having to go char by char

It seems that is_separator would be a bitset, and we can check for the necessity of processing the word for tsep by comparing it with 0. But I want to understand the len argument. Mainly because all number parsing functions rely on the string being null-terminated, and there is no version (from stdlib) that specifies how many bytes to read. So, to read up to len bytes, it would require manual processing, or copying into a buffer. Would str still be null terminated and the len member just specifies the length of the string considering the null-byte, like strlen()?

Additionally, I think it would need other members for other parsers - at least for float, that would require the decimal and exponent positions.

WillAyd · 2025-10-16T13:39:12Z

But I want to understand the len argument. Mainly because all number parsing functions rely on the string being null-terminated

In C technically all "strings" are null-terminated, its just not a requirement that a char* always be. In any case, the length member can still store the length of the char* plus a null terminator. The main goal of the len member is to prevent having to scan the byte sequence later with strlen to find the null terminator

Additionally, I think it would need other members for other parsers - at least for float, that would require the decimal and exponent positions.

Probably worth a dedicated issue to discuss design ideas

WillAyd

lgtm. @mroeschke anything to add here?

WillAyd · 2025-10-16T13:42:16Z

pandas/_libs/src/parser/tokenizer.c

-        }
-      }
-    }
+  if (errno == ERANGE || number > int_max || number < int_min) {


Suggested change

if (errno == ERANGE || number > int_max || number < int_min) {

if (errno == ERANGE || number > int_max || number < int_min) {

Are you doing a follow up to get rid of the int_max and int_min arguments? Its strange to have this in the diff given neither number > INT64_MAX nor number < INT64_MIN will ever be true, but I suppose within the larger context of there being an unnecessary argument for those I can see why you kept

Are you doing a follow up to get rid of the int_max and int_min arguments?

I pretend to. I've left it in the verification so that the function wouldn't contain unused arguments.

WillAyd · 2025-10-16T13:44:40Z

pandas/tests/io/parser/common/test_common_basic.py

+        ("2,334", 2334),
+        ("-2,334", -2334),
+        ("-2,334,", -2334),
+        ("2,,,,,,,,,,,,,,,5", 25),


I think its good to have this for test coverage but I'm doubtful it was intended even on main. Maybe we can add a comment that this is just to make sure our parser code is sound, but not necessarily to always ensure 2,,,,,,,,5 is parsed as 25?

…gine

mroeschke · 2025-10-16T16:37:43Z

Thanks @Alvaro-Kothe

Alvaro-Kothe · 2025-10-16T16:54:08Z

Thanks for the patience @WillAyd. I already created the PR that changes the function signature in #62714.

Co-authored-by: William Ayd <[email protected]>

refactor(parser): use integer parsing functions from stdlib

c3cc4a1

Alvaro-Kothe requested a review from WillAyd October 11, 2025 15:35

Alvaro-Kothe mentioned this pull request Oct 11, 2025

PERF: fix performance regression from #62542 #62623

Merged

6 tasks

WillAyd requested changes Oct 13, 2025

View reviewed changes

mroeschke added the Internals Related to non-user accessible pandas implementation label Oct 13, 2025

Alvaro-Kothe added 2 commits October 13, 2025 15:22

perf: use a local buffer to store the processed string

2459313

fix: use macro to fix MSVC build error

d8a454e

WillAyd requested changes Oct 13, 2025

View reviewed changes

Alvaro-Kothe added 4 commits October 13, 2025 16:11

fix: use bool

87789e6

refactor: don't pass PROCESSED_WORD_CAPACITY as a separate argument

2287944

perf: write in chunks

2bea3c2

hack: try bigger buffer size for arm error

fb38679

Alvaro-Kothe added 4 commits October 13, 2025 20:09

fix: solution without manipulating the string

f9ede5c

some cleanup

798c263

fix: use ptrdiff_t to fix MSVC build error

2f06f19

add other exponent cases for completion

280b55e

WillAyd reviewed Oct 14, 2025

View reviewed changes

Alvaro-Kothe added 6 commits October 14, 2025 09:29

fix: use builtin overflow check verification

d85aaf0

fix: change std to c2x

9046ecc

Revert previous commits

a4e2fb8

Revert "fix: change std to c2x" This reverts commit 9046ecc. Revert "fix: use builtin overflow check verification" This reverts commit d85aaf0.

refactor: move overflow check to header

ef82cf4

refactor: use overflow check from numpy

5afeb11

fix: handle negative check

0ef47a7

WillAyd reviewed Oct 14, 2025

View reviewed changes

Alvaro-Kothe added 3 commits October 14, 2025 10:34

fix: add test for thousand separator with negative number

c840ef0

move to portable

132342b

perf: use builtin unsigned long overflow check

e6977cc

WillAyd reviewed Oct 14, 2025

View reviewed changes

WillAyd requested changes Oct 14, 2025

View reviewed changes

fix: add const qualifier

d693345

WillAyd requested changes Oct 15, 2025

View reviewed changes

Alvaro-Kothe added 6 commits October 15, 2025 17:18

fix: remove unnecessary NULL and null-byte checks

593c614

fix: remove unnecessary errno verification

ff4d48b

chore: remove NULL assignment

e3a88d3

fix: don't recompute strlen

b135738

chore: add some comments back to simplify diff

ba8c9b3

fix: reset errno after handling it

818921f

WillAyd reviewed Oct 15, 2025

View reviewed changes

Alvaro-Kothe added 8 commits October 15, 2025 18:27

fix: put back const char p

3e067f7

fix: improve diff for sign handling

c0ed83c

fix: improve diff for trailing whitespace

cb60adb

chore: remove newline to simplify diff even more

5117e89

chore: drop another superfluous comment

e1e327a

test: xfail python engine with consecutive thousand separators

ffcb7c2

fix: move errno handling to avoid polution and early return

47b87f9

chore: rename to number for diff

cd536fb

WillAyd approved these changes Oct 16, 2025

View reviewed changes

chore: add comment explaining consecutive thousand separators in C en…

1ef9259

…gine

mroeschke approved these changes Oct 16, 2025

View reviewed changes

mroeschke merged commit 533821c into pandas-dev:main Oct 16, 2025
42 checks passed

mroeschke added this to the 3.0 milestone Oct 16, 2025

Alvaro-Kothe mentioned this pull request Oct 16, 2025

CLN: remove redundant integer limit argument from interger parsers #62714

Merged

Alvaro-Kothe deleted the refactor/tokenizer-int-stdlib branch October 16, 2025 16:54

WillAyd mentioned this pull request Oct 16, 2025

Consolidate decimal digit -> str functions in tokenizer.c #62717

Open

eicchen pushed a commit to eicchen/pandas that referenced this pull request Oct 18, 2025

CLN: use integer parsing functions from stdlib (pandas-dev#62658)

4405a17

Co-authored-by: William Ayd <[email protected]>

	static inline int has_digit_int(const char *str) {
	static inline bool has_digit_int(const char *str) {

	static void copy_string_without_char(char output, const char str,
	static void copy_string_without_char(char output[PROCESSED_WORD_CAPACITY], const char *str,

	if (errno == ERANGE \|\| number > int_max \|\| number < int_min) {
	if (errno == ERANGE \|\| number > int_max \|\| number < int_min) {

Uh oh!

CLN: use integer parsing functions from stdlib #62658

CLN: use integer parsing functions from stdlib #62658

Uh oh!

Conversation

Alvaro-Kothe commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillAyd left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe commented Oct 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Alvaro-Kothe commented Oct 11, 2025 •

edited

Loading

WillAyd left a comment •

edited

Loading

WillAyd Oct 13, 2025 •

edited

Loading

Alvaro-Kothe Oct 13, 2025 •

edited

Loading